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ABSTRACT 

If  relational  data  contain  communities — groups  of  inter-related 
items  with  similar  attribute  values — a  clustering  technique 
that  considers  attribute  information  and  the  structure  of 
relations  simultaneously  should  produce  more  meaningful 
clusters  than  those  produced  by  considering  attributes  alone. 
We  investigate  this  hypothesis  in  the  context  of  a  spectral 
graph  partitioning  technique,  considering  a  number  of  hy¬ 
brid  similarity  metrics  that  combine  both  sources  of  infor¬ 
mation.  Through  simulation,  we  find  that  two  of  the  hybrid 
metrics  achieve  superior  performance  over  a  wide  range  of 
data  characteristics.  We  analyze  the  spectral  decomposition 
algorithm  from  a  statistical  perspective  and  show  that  the 
successful  hybrid  metrics  exaggerate  the  separation  between 
cluster  similarity  values,  at  the  expense  of  increased  vari¬ 
ance.  We  cluster  several  relational  datasets  using  the  best 
hybrid  metric  and  show  that  the  resulting  clusters  exhibit 
significant  community  structure,  and  that  they  significantly 
improve  performance  in  a  related  classification  task. 

Categories  and  Subject  Descriptors 

1.5.3  [Clustering]:  Pattern  Recognition 

Keywords 

Clustering,  Relational  Learning,  Spectral  Analysis 

1.  INTRODUCTION 

Spectral  clustering  techniques,  which  partition  data  into  dis¬ 
joint  clusters  using  the  eigenstructure  of  a  similarity  ma¬ 
trix,  have  been  successfully  applied  in  a  number  of  domains, 
including  image  segmentation  [19]  and  document  cluster¬ 
ing  [5].  Finding  an  optimal  partition  is  in  general  NP  com¬ 
plete,  but  the  eigenvectors  of  the  matrix  provide  some  infor¬ 
mation  that  can  be  used  to  guide  an  approximate  solution. 
Experimental  evidence  has  shown  this  heuristic  approach  of¬ 
ten  works  well  in  practice  and  has  prompted  further  inves¬ 
tigation  into  the  properties  of  spectral  clustering.  Recent 
findings — facilitated  by  a  long  history  of  work  in  spectral 


graph  theory  (e.g.,  [2])-  include  a  connection  to  random 
walks  [13]  and  preliminary  performance  analysis  [10,  16]. 
In  this  paper,  we  investigate  methods  of  adapting  spectral 
clustering  techniques  to  relational  domains. 

The  goal  of  this  work  is  to  find  communities  in  relational 
data  represented  as  an  attributed  graph  G  =  (V,E,X), 
where  the  nodes  V  represent  objects  in  the  data  (e.g.,  genes), 
the  edges  E  represent  relations  among  the  objects  (e.g.,  in¬ 
teractions),  and  the  attributes  X  record  data  about  each  ob¬ 
ject  (e.g.,  localization).  Community  clusters  identify  groups 
of  objects  that  have  similar  attributes  and  are  also  highly 
inter-related.  For  example  in  genomic  data,  a  group  of  genes 
with  similar  attributes  and  many  common  interactions  may 
all  be  involved  in  a  similar  function  in  the  cell.  The  underly¬ 
ing  assumption  is  that  there  is  a  latent  cluster  variable  that 
influences  both  the  attribute  values  intrinsic  to  objects  and 
the  relationships  among  objects.  In  particular,  objects  are 
more  likely  to  link  to  other  objects  in  the  same  cluster  than 
objects  in  other  clusters,  and  pairs  of  objects  within  a  clus¬ 
ter  are  more  likely  to  have  similar  attribute  values  than  pairs 
spanning  different  clusters.  A  clustering  algorithm  that  ex¬ 
amines  both  link  structure  and  attributes  simultaneously 
should  be  more  robust  to  noise  than  methods  examining 
attribute  or  link  information  in  isolation. 

There  has  been  little  work  applying  spectral  techniques  to 
relational  domains  with  a  combination  of  link  and  attribute 
information.  Existing  techniques  use  either:  (1)  a  complete 
graph  where  attribute  similarity  is  calculated  for  all  n  x  n 
pairs  of  objects  (e.g.,  [16]),  or  (2)  a  nearest  neighbor  graph, 
where  attribute  similarity  is  calculated  for  n  x  d  pairs  of 
objects — each  object  is  connected  to  a  fixed  number  ( d )  of 
other  objects  determined  by  spatial  locality  (e.g.,  [19]).  Our 
work  differs  in  that  we  are  trying  to  incorporate  the  hetero¬ 
geneous  relational  structure  into  the  similarity  metric. 

The  similarity  metric,  used  to  populate  the  similarity  ma¬ 
trix,  provides  a  means  to  extend  spectral  techniques  to  new 
domains.  However,  the  success  of  spectral  clustering  tech¬ 
niques  depends  heavily  on  the  choice  of  metric.  There  has 
been  some  research  into  learning  the  correct  similarity  func¬ 
tion  from  labeled  data  (e.g.,  [1]),  but  for  domains  where  the 
correct  clustering  is  unknown,  design  has  been  approached 
in  a  relatively  ad-hoc  manner.  This  leaves  us  with  little  guid¬ 
ance  as  to  how  to  incorporate  link  and  attribute  information 
into  a  metric  for  relational  domains.  This  work  investigates 
the  design  of  similarity  metrics  that  incorporate  multiple 
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sources  of  information  and  identifies  the  characteristics  that 
underlie  successful  metrics. 

Specifically,  we  analyze  the  normalized  cut  (NCut)  spec¬ 
tral  partitioning  algorithm  [19]  from  a  statistical  perspec¬ 
tive.  For  the  special  case  of  bi-partitioning,  we  show  that  as 
cluster  size  — >  oo,  the  spectral  decomposition  will  include  an 
eigenvector  that  is  piecewise  constant,  with  respect  to  the 
clusters,  for  any  similarity  metric  where  the  average  intra¬ 
cluster  similarity  differs  from  the  average  inter-cluster  sim¬ 
ilarity.  If  the  eigenvector  associated  with  the  2nd  smallest 
eigenvalue  of  the  similarity  matrix  is  piecewise  constant,  the 
spectral  partitioning  will  be  exact  [19].  Next,  we  empirically 
evaluate  the  effect  of  finite  cluster  sizes  using  synthetic  data. 
We  show  that:  (1)  decreasing  variance  of  cluster  similari¬ 
ties,  and  increasing  separation  of  similarities,  both  improve 
the  ordering  of  the  eigenvector  with  respect  to  the  clusters, 
and  (2)  increasing  the  separation  of  cluster  similarities  has 
a  greater  impact  on  algorithm  performance  when  the  NCut 
objective  function  is  used.  This  indicates  that  a  metric  that 
increases  variance  in  order  to  better  separate  the  cluster  sim¬ 
ilarities  will  perform  better  over  a  wider  range  of  conditions. 
Based  on  these  results,  we  propose  a  hybrid  similarity  metric 
for  relational  data  that  incorporates  link  and  attribute  infor¬ 
mation,  and  we  evaluate  performance  on  several  relational 
datasets.  We  show  that  resulting  clusters  exhibit  signifi¬ 
cant  community  structure  and  demonstrate  significant  per¬ 
formance  gains  when  using  the  resulting  clusters  in  a  related 
classification  task. 

2.  SPECTRAL  CLUSTERING 

Spectral  clustering  originated  with  graph  partitioning  tech¬ 
niques  that  exploit  the  connection  between  eigenvectors  and 
algebraic  properties  of  a  graph  (e.g.,  [6,  7]).  Recently,  Shi 
and  Malik  [19]  presented  a  new  clustering  algorithm  that 
uses  spectral  partitioning  to  optimize  the  NCut  objective 
function.  We  investigate  the  application  of  this  algorithm 
to  relational  domains  through  the  use  of  similarity  metrics 
that  incorporate  link  and  attribute  information. 

The  NCut  algorithm  of  [19]  clusters  datasets  through  eigen¬ 
value  decomposition  of  a  similarity  matrix.  The  algorithm 
is  a  divisive,  hierarchical  clustering  algorithm,  which  takes  a 
graph  G  =  {V,  E),  a  set  of  k  attributes  X  =  {X1,  •  •  •  ,  Xk}, 
where  Xk  =  {x%  :  Vi  £  V},  and  a  similarity  function  S, 
where  S(i,j)  defines  the  similarity  between  Vi,Vj  £  V,  and 
recursively  partitions  the  graph  as  follows: 

Let  W atx  at  =  [5(*,j)]  be  the  similarity  matrix  and  let  D 
be  an  N  x  N  diagonal  matrix  with  di  =  ^2jev  S(i,j).  Solve 
the  eigensystem  (D  —  W)x  =  ADx  for  the  eigenvector  xi 
associated  with  the  2nd  smallest  eigenvalue  Ai .  Consider  m 
uniform  values  between  the  minimum  and  maximum  value 
in  xi.  For  each  value  m:  bipartition  the  nodes  into  ( A,B ) 
such  that  AnB  =  9),AuB  =  V,  and Vva  £  Ax\a  <  m,  and 
calculate  the  NCut  value  for  the  partition,  NCut(A,  B)  = 

£,gA,jgB  s(*,i)  .  Partition  the  graph  into 

the  ( A,B )  with  minimum  NCut.  If  stability  (A,  B)  <  c,  re¬ 
cursively  repartition  A  and  B.1 

4We  use  the  stability  threshold  proposed  in  [19]  where  the  sta¬ 
bility  value  is  the  ratio  of  the  minimum  and  maximum  bin  sizes, 
after  the  values  of  xi  are  binned  by  value  into  m  bins.  All  the  ex¬ 


it  takes  0(n3)  operations  to  solve  for  all  eigenvalues  of  an 
arbitrary  eigensystem.  However,  0(\E\)  approximate  algo¬ 
rithms  exist  [10],  and  if  the  weight  matrix  is  sparse,  0(n1,4) 
Lanczos  algorithms  can  be  used  to  compute  the  solution  [18] — 
for  this  reason,  similarity  metrics  that  produce  sparse  ma¬ 
trices  are  preferable. 

Our  hybrid  metrics  calculate  the  similarity  between  objects 
i  and  j  through  a  weighted  combination  of  attribute  and  link 
information:  S(i,j )  =  a  ■  £  J2k  sk(i,j)  +  (1  —  a)  ■  l,  where 
Sk(i,j )  =  1  if  '==  Xj  and  0  otherwise,  and  l  =  1  if  dj  £  E 
or  eji  £  E,  and  0  otherwise. 

When  a  =  1,  we  refer  to  the  metric  as  AttrOnly.  When 
a  =  0,  we  refer  to  the  metric  as  LinkOnly.  These  metrics 
are  included  as  baselines — one  for  data  clustering  techniques 
that  ignore  link  information,  and  the  other  for  graph  par¬ 
titioning  techniques  that  ignore  attribute  information.  At¬ 
trOnly  calculates  similarity  by  counting  the  number  of  at¬ 
tribute  values  objects  i  and  j  have  in  common  (scaled  by  k  so 
the  maximum  similarity  is  1).  LinkOnly  uses  the  relational 
structure  as  a  measure  of  similarity. 

When  a  =  ttt,  we  refer  to  the  metric  as  LinkAsAttr.  This 
approach  is  an  obvious  way  to  include  relational  information — 
links  are  incorporated  as  a  match  on  the  (k+  l)th  attribute. 
With  no  prior  domain  knowledge,  we  have  no  reason  to  ex¬ 
pect  that  link  structure  contains  more  information  than  at¬ 
tribute  values.  However,  link  structure  is  often  central  in 
relational  domains — for  example,  in  a  graph  of  hyperlinked 
web  documents,  we  expect  a  link  to  confer  more  information 
about  topic  clustering  than  a  match  on  a  single  word  for  two 
pages.  To  better  exploit  the  relational  information,  we  set 
a  =  |.  This  metric,  referred  to  as  WtLinkAttrl,  combines 
the  link  and  attribute  information  uniformly — high  similar¬ 
ity  indicates  that  two  objects  are  related  or  have  a  number 
of  attribute  values  in  common. 

In  sparse  relational  graphs,  the  expected  intra-cluster  link 
similarity  will  be  less  than  one,  even  if  the  links  are  per¬ 
fectly  correlated  with  cluster  membership.  In  this  case,  if  the 
link  and  attribute  information  are  combined  uniformly  (e.g., 
WtLinkAttrl),  or  if  the  attributes  are  given  proportionally 
more  weight  (e.g.,  LinkAsAttr) ,  noise  in  the  attributes  can 
drown  out  a  strong  link  signal.  An  approach  that  gives  the 
link  information  proportionally  more  weight  (e.g.,  a  >  |) 
may  achieve  better  performance.  In  practice  we  will  not 
know  how  to  scale  the  link  information  to  combine  the  two 
sources  of  information  equally.  However,  for  the  synthetic 
experiments  discussed  in  the  next  section,  we  know  the  max¬ 
imum  edge  probability  is  0.2  so  setting  a  =  |  equalizes  the 
attribute  and  link  signals.  When  a  =  | ,  we  refer  to  the  met¬ 
ric  as  WtLinkAttr2.  Although  we  will  not  know  the  scaling 
factor  in  practice,  we  include  this  metric  to  test  the  con¬ 
jecture  that  the  poor  performance  of  WtLinkAttrl  is  due 
to  the  relatively  weak  link  signal  being  combined  uniformly 
with  the  attribute  signal. 

When  a  =  l,  we  refer  to  the  metric  as  LinkAsFilter.  It  cal- 

periments  in  this  paper  used  the  settings:  m  =  [logoiN)  + 1] ,  and 
c  =  0.06.  Sensitivity  analysis  on  synthetic  data  shows  c  =  0.06  to 
be  a  conservative  threshold,  returning  clusters  with  high  precision 
but  low  recall. 


culates  similarity  by  weighting  the  existing  edges  of  G  with 
the  AttrOnly  metric.  Objects  that  are  not  directly  related 
have  a  similarity  of  0  regardless  of  their  attribute  values.  A 
high  similarity  score  indicates  that  two  objects  are  related 
and  have  a  number  of  attribute  values  in  common.  This  ap¬ 
proach  incorporates  both  sources  of  information  while  main¬ 
taining  the  sparsity  of  the  relational  data  graph  so  the  algo¬ 
rithm  can  use  efficient  eigensolver  techniques. 

3.  ALGORITHM  ANALYSIS 

The  recursive  nature  of  the  algorithm  complicates  analy¬ 
sis  of  higher-order  partitioning,  so  we  restrict  our  attention 
to  the  (simpler)  case  of  a  single  bipartitioning  of  the  graph. 
Finding  an  optimal  partition,  which  minimizes  the  NCut  cri¬ 
terion,  is  an  NP-hard  problem  [19].  However,  [19]  shows  that 
when  there  is  a  partition  (A,  B)  of  V  such  that  the  2nd  small¬ 
est  eigenvector  xi,  of  the  eigensystem  (D  —  W)x  =  ADx, 
is  piecewise  constant  with  respect  to  a  partition  ( A,  B ): 
xij  =  a,  i  £  A,  and  xh  =  /?,  i  £  B,  (5  ^  a,  then  (A,  B ) 
is  the  optimal  partition — it  minimizes  the  NCut  criterion 
and  Ai  =  NCut. 

Recent  analysis  has  focused  on  achieving  a  more  thorough 
understanding  of  the  conditions  under  which  xi  will  be  piece- 
wise  constant.  Meila  and  Shi  [13]  outline  a  set  of  condi¬ 
tions  under  which  the  spectral  algorithm  will  return  an  ex¬ 
act  partitioning,  showing  that  the  spectral  problem  formu¬ 
lated  for  NCut  is  equivalent  to  the  eigenvectors/ values  of 
the  stochastic  matrix  P  =  D_1W.  The  authors  connect 
spectral  clustering  to  Markov  random  walks,  showing  that 
P  will  have  an  eigenvector  that  is  piecewise  constant  w.r.t. 
a  partition  (Ai,A2)  iff  P  is  block-stochastic  w.r.t.  (Ai,A2). 
Here,  block-stochastic  means  that  the  underlying  Markov 
random  walk  can  be  viewed  as  a  Markov  chain  with  state 
space  A  =  (Ai,  A%)  and  transition  probability  matrix  R  = 
[P ss/]s,s/=i,2?  where  for  s,  s'  =  1,2,  ,  P»jis  constantVi  G 

As,  and  Pss/  =  ,  P*?  for  any  i  £  As.  This  shows  that 

spectral  clustering  groups  nodes  based  on  the  similarity  of 
their  transition  probabilities  to  subsets  of  the  graph. 

There  has  been  little  analysis  of  the  impact  of  non-constant 
transition  probabilities  on  algorithm  performance.  Empir¬ 
ical  evidence  indicates  that  the  algorithm  finds  good  par¬ 
titions  even  when  the  transition  probabilities  are  far  from 
constant.  Ideally,  we  would  like  to  characterize  the  condi¬ 
tions  necessary  for  optimal  performance  and  bound  algo¬ 
rithm  performance  otherwise.  As  a  first  step,  we  analyze 
asymptotic  performance  for  non-constant  intra-  and  inter¬ 
cluster  transition  probabilities. 

If  we  assume  a  generative  model  of  the  data  where  a  latent 
cluster  variable  (Ai,  A2),  determines  the  attribute  values  in¬ 
trinsic  to  the  objects  and  the  relationships  among  objects, 
we  can  analyze  the  similarity  metric  S(i,j),  and  each  entry 
in  W,  as  a  random  variable.  Consider  the  entries  of  row 
i.  The  entries  W,;, ,  W,;*,  are  not  independent  because  the 
similarity  values  are  both  based  on  node  i.  However,  con¬ 
ditioned  on  the  state  of  i  (e.g.,  attribute  values  of  i),  the 
entries  are  independent  random  variables  since  the  state  of 
j  is  independent  of  the  state  of  k.  As  a  result,  the  entries 
of  row  i  can  be  viewed  as  independent  random  variables. 
With  this  model  we  can  show  that  any  similarity  metric  will 
produce  piecewise  constant  eigenvectors  in  the  limit. 


Theorem:  Let  A  =  (Ai,A2)  be  a  partition  of  V.  Let 
the  function  S(i,j)  define  the  similarity  measure  between 
Vi,Vj  £  V.  If.  Vi,j,k,  S(i,j)  is  conditionally  independent 
of  S(i,  k)  given  node  i,  and  P[Pn]i?[P22]  ^  f?[Pi2]if[P2i] 
then,  P  has  an  eigenvector  that  will  converge  to  piecewise 
constant  w.r.t.  A  as  |Ai|,  | A2 1  — >  00. 

We  provide  the  intuition  for  the  proof  here  and  refer  the 
reader  to  Appendix  A  for  details.  If  we  view  the  entries  of 
W  as  random  variables,  the  normalized  values  in  P  are  also 
random  variables  (i.e.,  the  entries  in  W  divided  by  a  row 
sum  of  random  variables).  The  total  intra-  and  inter-cluster 
transition  probabilities  in  P  (e.g.,  ,  P tj)  then  corre¬ 

spond  to  the  ratio  of  two  sums  of  random  variables.  Since 
the  transition  probabilities  are  composed  of  sums  of  inde¬ 
pendent  random  variables,  as  cluster  size  — >  00,  the  intra- 
and  inter-cluster  transition  probabilities  will  converge  to  the 
same  value  for  all  nodes  in  each  cluster.  Therefore  an  eigen¬ 
vector  of  the  similarity  matrix  will  converge  to  piecewise  con¬ 
stant  w.r.t.  (Ai ,  A2),  provided  the  intra-  and  inter-cluster 
means  (e.g.,  £[Pn],  P[Pi2[)  are  distinguishable. 

This  analysis  indicates  that  all  metrics  will  perform  equally 
in  the  limit.  We  expect  however,  that  finite  sample  perfor¬ 
mance  will  vary  based  on  the  characteristics  of  the  metrics. 

In  particular,  we  expect  that  performance  will  be  influenced 
by  the  mean  and  variance  of  the  intra-  and  inter  cluster 
transition  probabilities.  We  demonstrate  the  impact  of  the 
transition  probability  distributions  below,  using  synthetic 
data  experiments. 

4.  SYNTHETIC  DATA  EXPERIMENTS 

In  order  to  identify  the  situations  where  we  can  expect  each 
of  the  similarity  metrics  to  perform  well,  we  evaluate  al¬ 
gorithm  performance  on  synthetic  data  sets  for  which  the 
correct  clustering  is  known.  This  facilitates  analysis  over  a 
wide  range  of  conditions. 

4.1  Synthetic  Data 

Our  synthetic  data  sets  are  undirected,  connected  graphs 
( G  =  (V,  E))  where  nodes  correspond  to  objects  and  edges 
correspond  to  relations  among  objects.  Unless  otherwise  in¬ 
dicated,  |V|  =  200.  A  binary  label,  C  =  {+,—},  is  used 
to  represent  cluster  membership;  labels  are  assigned  ran¬ 
domly  to  each  object  with  P(+)  =  0.5.  Each  object  has  five 
binary  attributes,  where  the  attribute  values  are  assigned 
randomly  given  the  object’s  cluster  label.  Edges  are  added 
to  the  graph  by  considering  each  pair  of  objects  in  V  in¬ 
dependently,  and  adding  edges  randomly  given  the  cluster 
labels  of  the  two  objects. 

The  experiments  record  algorithm  performance  while  vary¬ 
ing  both  attribute  and  link  association.  Within  each  level  of 
correlation,  all  five  attributes  were  generated  with  the  same 
probability:  P+  =  P(A  =  1\C  =  +)  =  {0.50,  0.55, . . . ,  0.95, 1.0}, 
P -  =  P(A  =  1IC  =  — )  =  1.0  —  P+.  The  symmetry  in  at¬ 
tribute  parameters  simplifies  the  analytical  analysis  but  it 
is  not  necessary  for  algorithm  correctness.  Intra-cluster  and 
inter-cluster  links  were  generated  with  the  following  range  of 
probabilities:  P.L  =  P(eij  |ei  =  Cj)  =  {0.10,  0.12, ... ,  0.18,  0.20}, 
Pout  =  P{e-ij\Ci  ^  Cj )  =  0.2  —  Plin.  Here  the  range  of  prob¬ 
abilities,  and  symmetry,  was  chosen  to  produce  a  graph  with 


approximately  10%  of  the  n(n  —  l)/2  possible  edges.  This 
level  of  linkage  is  comparable  to  the  levels  of  sparsity  we 
have  observed  in  real-world  relational  data  sets. 

4.2  Metric  Performance 

We  measured  the  accuracy  of  the  six  metrics  across  the  range 
of  attribute  and  link  probabilities  described  above.  Figure  1 
reports  the  accuracy  of  the  clusterings  returned  by  the  simi¬ 
larity  metrics,  averaged  over  100  trials  at  each  setting.  Note 
that  the  bottom,  foremost  corner  of  each  plot  represents 
completely  random  link  and  attribute  information,  where 
no  metric  should  do  better  than  0.5. 

LinkOnly  and  AttrOnly  performance  is  as  expected — they 
perform  well  when  the  link,  or  respectively  attribute,  signal 
is  moderate  to  high,  but  poorly  otherwise.  The  LinkAsAttr 
and  WtLinkAttrl  results  are  comparable  to  AttrOnly.  How¬ 
ever,  the  LinkAsFilter  and  WtLinkAttr2  metrics  achieve 
perfect  accuracy  over  a  wide  range  of  conditions,  with  LinkAs¬ 
Filter  covering  more  space  than  WtLinkAttr2.  These  met¬ 
rics  should  yield  good  results  in  datasets  where  either  the 
links  or  the  attributes  are  moderately  correlated  with  the 
clusters.  However,  they  do  not  always  perform  as  well  as 
LinkOnly  and  AttrOnly.  Consider  the  LinkOnly  results  when 
link  correlation  is  moderate  and  attribute  correlation  is  low — 
both  hybrid  metrics  achieve  significantly  lower  accuracy  than 
would  be  achieved  considering  links  in  isolation.  Similar  be¬ 
havior  is  apparent  for  the  AttrOnly  metric,  but  notice  that 
the  effect  is  more  pronounced  in  this  situation.  This  indi¬ 
cates  that  the  two  metrics  rely  more  heavily  on  link  infor¬ 
mation  and  illustrates  the  tradeoff  for  utilizing  both  sources 
of  information — the  additional  information  increases  vari¬ 
ance,  which  will  impair  performance  in  some  situations,  in 
exchange  for  better  coverage  of  the  space. 

4.3  Performance  Analysis 

LinkAsFilter  and  WtLinkAttr2  achieve  superior  performance 
over  a  wide  range  of  data  characteristics,  but  what  is  the 
mechanism  by  which  this  occurs?  Following  our  analysis 
in  section  3,  we  hypothesize  that  metric  performance  is  in¬ 
fluenced  by  intra-  and  inter-cluster  transition  probabilities. 
We  conjecture  that  the  algorithm  will  be  able  to  distinguish 
clusters,  if  the  distributions  of  intra-  and  inter-cluster  tran¬ 
sition  probabilities  are  separable,  where  separation  depends 
on  the  mean  and  variance  of  the  transition  probabilities. 

Given  our  data  generation  parameters,  we  can  calculate 
intra-  and  inter-cluster  mean  transition  probabilities  ana¬ 
lytically.  Recall  that  our  data  generation  process  produces 
the  same  distribution  for  each  cluster,  and  furthermore,  we 
know  that  the  transition  probabilities  in  P  are  normalized 
to  sum  to  one.  This  means  we  can  examine  ppin  =  E\Pin\ 
from  a  single  set  of  distributions,  Hpin  and  (J.pout  ■  When 
ppin  =  1.0  there  is  maximal  separation  between  the  two 
clusters;  ppin  =  0.5  corresponds  to  no  separation. 

Figure  2  graphs  pLpin  vs.  attribute/link  correlations.  The 
shapes  of  the  graphs  are  quite  similar  to  the  accuracy  graphs 
in  figure  1,  indicating  a  strong  relationship  between  mean 
separation  and  algorithm  performance.  However,  the  areas 
where  we  observe  perfect  performance  (i.e.,  accuracy  =  1.0) 
do  not  necessarily  correspond  to  maximum  mean  separa¬ 
tion  (i.e.,  ppin  <  1.0).  This  illustrates  a  difference  between 


the  LinkAsFilter  and  WtLinkAttr2  metrics — ppin  is  signif¬ 
icantly  higher  on  average  for  the  LinkAsFilter  metric. 

To  examine  the  effect  of  ppirl  on  algorithm  performance, 
we  analyzed  the  data  from  all  metrics  concurrently.  Figure 
3a  graphs  ppin  vs.  accuracy  for  the  experiments  reported 
above,  combining  results  from  all  the  metrics  in  the  same 
graph.  There  is  a  clear  relationship  between  fipin  and  accu¬ 
racy  (corr=  0.849,  p  <C  0.05) — accuracy  is  consistenly  high 
for  ppin  >  0.675  and  consistently  low  otherwise.  We  looked 
at  the  association  between  ppin  and  the  eigenvector  val¬ 
ues  in  xi  using  a  number  of  different  measures  of  eigenvec¬ 
tor  stability.  Only  one  measure  showed  a  clear  relationship 
to  ppin — a  measure  of  the  quality  of  the  ordering  in  the 
(sorted)  eigenvector,  which  looked  at  the  sorted  eigenvector 
and  recorded  the  maximum  accuracy  possible  from  the  set 
of  m  possible  partition  values  considered  by  the  algorithm. 
The  linear  search  for  an  optimal  partition  (in  the  NCut  al¬ 
gorithm)  should  not  be  adversely  affected  by  degradation  of 
piecewise  constancy  unless  the  degradation  also  affects  the 
ordering  of  objects’  eigenvector  values.  If  the  maximum  ac¬ 
curacy  is  low,  this  indicates  disorder  in  the  eigenvector.  The 
evector  ordering  measure  is  graphed  against  ppiTl  in  figure 
3b.  It  shows  that  decreasing  ppin  results  in  a  disordering  of 
the  eigenvector  values.  These  results  explain  the  high  accu¬ 
racy  results — for  fipin  >  0.675  there  is  little  disorder  in  the 
eigenvector. 

Figure  3c  graphs  evector  ordering  vs.  accuracy.  There  is 
a  strong  correlation  between  evector  ordering  and  accuracy, 
but  there  are  also  a  significant  number  of  trials  with  very 
little  disorder  that  achieve  only  low  accuracy.  This  effect 
is  explained  by  figure  3d,  where  we  graph  the  precision  of 
the  smallest  cluster  returned  by  the  algorithm.  This  shows 
that  when  the  eigenvector  is  ordered  correctly  but  the  al¬ 
gorithm  only  achieves  low  accuracy,  it  is  because  the  algo¬ 
rithm  prefers  to  separate  a  small,  but  pure,  cluster  from  the 
rest  of  the  graph.  Why  does  the  algorithm  break  off  small, 
high-precision  clusters  even  when  the  eigenvector  ordering 
is  correct?  This  is  not  a  spurious  effect  due  to  consideration 
of  only  a  small  number  of  thresholds  (e.g.,  m  values).  It 
remains  consistent  even  when  we  set  m  =  N.  We  discuss 
reasons  for  this  effect  below. 

We  have  shown  that  mean  separation  affects  algorithm  per¬ 
formance  through  the  ordering  of  the  objects’  eigenvector 
values,  but  how  does  variance  interact  with  mean  sepa¬ 
ration  to  degrade  performance?  Figures  4a-b  graph  the 
same  variables  as  figure  3a,  but  for  a  set  of  experiments 
with  |  FI  =  500,  and  |F|  =  50.  This  illustrates  the  im¬ 
pact  of  decreased,  and  increased,  variance  in  the  transition 
probabilities — increasing  variance  impairs  performance  for 
all  l-i pin  ,  but  decreasing  variance  only  improves  performance 
for  ppin  >  0.675.  This  is  contrary  to  our  expectation  that 
decreased  variance  would  improve  performance  by  increas¬ 
ing  the  separation  between  cluster  transition  probabilities. 
However,  this  effect  is  due  to  the  NCut  optimization,  not 
the  ordering  of  the  eigenvector  values.  Figure  4c  shows  a 
box  plot  of  evector  ordering  as  a  function  of  sample  size,  for 
the  set  of  trials  with  ppin  <  0.675.  Except  for  the  small¬ 
est  sample  size,  where  we  see  higher  accuracy  due  to  chance 
alone,  the  mean  ordering  value  is  monotonically  increasing 
with  sample  size.  Figure  4d  graphs  accuracy  results  for  the 


Figure  1:  Cluster  accuracy  of  metrics  on  synthetic  data:  (a)  AttrOnly,  (b)  LinkOnly,  (c)  LinkAsAttr,  (d) 
WtLinkAttrl,  (e)  WtLinkAttr2,  and  (f)  LinkAsFilter. 


Figure  2:  Intra-cluster  means  of  metrics  for  synthetic  data:  (a)  AttrOnly,  (b)  LinkOnly,  (c)  LinkAsAttr,  (d) 
WtLinkAttrl,  (e)  WtLinkAttr2,  and  (f)  LinkAsFilter. 
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Figure  3:  Analysis  of  intra-cluster  mean  on  algo¬ 
rithm  performance:  (a)  200  objects,  (b)  fipin  vs.  or¬ 
dering,  (c)  ordering  vs.  accuracy  and  (d)  precision. 


Figure  4:  Analysis  of  intra-cluster  variance  on  al¬ 
gorithm  performance:  (a)  500  objects,  (b)  50  ob¬ 
jects,  (c)  ordering  and  (d)  accuracy  for  settings  with 

jj,Pin  <  0.675. 


same  sample,  showing  that  the  algorithm  converges  to  low 
accuracies  as  sample  size  increases.  Maximizing  the  NCut 
criterion  causes  the  algorithm  to  consistently  prefer  high 
precision  over  high  accuracy  when  the  separation  between 
intra-  and  inter-cluster  transition  probabilities  is  low  (i.e., 
/, Lpin  <  0.675).  This  indicates  that  metrics  with  low  /r pin 
should  not  be  combined  with  the  NCut  criterion. 

It  is  now  clear  that  the  WtLinkAttr2  and  LinkAsFilter  met¬ 
rics  achieve  their  good  performance  due  to  high  y,pin,  but 
what  do  they  tradeoff  for  this  increased  separation?  Fig¬ 
ure  5a  graphs  a  box  plot  of  fipirl  for  each  metric  individually. 
This  is  a  one-dimensional  summary  of  the  data  in  figure  2, 
which  again  illustrates  that  the  /ipirl  is  significantly  higher 
for  the  LinkAsFilter  metric  on  average.  Figure  5b  graphs  a 
box  plot  of  the  variance  of  Pin  for  each  metric.  This  shows 
that  LinkAsFilter  trades  off  higher  variance  for  increased 
mean  separation.  Figure  4c-d  graphs  the  performance  of 
WtLinkAttr2  and  LinkAsFilter  for  \V\  =  50.  Compare  this 
to  figure  1  to  see  that  performance  degradation  is  not  uni¬ 
form  across  metrics.  The  LinkAsFilter  metric  is  adversly 
affected  over  a  wider  range  of  data  conditions.  This  il¬ 
lustrates  the  primary  distinction  between  LinkAsFilter  and 
WtLinkAttr2.  The  LinkAsFilter  metric  reduces  the  amount 
of  information  it  uses  in  order  to  increase  the  mean  sepa¬ 
ration  between  the  clusters.  Because  it  is  filtering  the  at¬ 
tribute  information  through  the  existing  edges  of  the  graph, 
it  throws  away  both  useful  and  noisy  data  and  increases  the 
variance  of  the  transition  probabilities.  If  the  sample  size  is 
large  enough  to  withstand  this  increase  in  variance,  then  the 
metric  will  produce  superior  clusterings.  However,  when  the 
sample  size  is  low,  the  filter  can  do  more  harm  than  good. 
For  example,  filtering  through  the  existing  edges  may  dis¬ 
connect  a  previously  connected  cluster.  In  these  situations, 
it  may  be  best  to  use  the  WtLinkAttr2  metric,  which  suf¬ 


fers  less  from  increased  variance  and  still  performs  well  over 
a  wide  range  of  data  characteristics.  However,  since  we  do 
not  know  how  to  set  a  for  WtLinkAttr2  in  practice,  and 
because  LinkAsFilter  offers  the  opportunity  to  use  efficient 
eigensolver  techniques,  we  focus  on  LinkAsFilter  for  our  em¬ 
pirical  data  experiments. 

5.  EMPIRICAL  DATA  EXPERIMENTS 

The  experiments  reported  below  are  intended  to  evaluate 
two  assertions.  The  first  claim  is  that  the  LinkAsFilter  clus¬ 
tering  approach  can  be  used  to  find  groups  of  items  with 
similar  attribute  values  and  high  inter-connectedness.  We 
evaluate  this  claim  by  comparing  the  clusters  produced  by 
the  LinkAsFilter  metric  to  randomly  generated  clusters  of 
the  same  size,  evaluating  intra-cluster  attribute  similarity 
and  intra-cluster  linkage. 

The  second  claim  is  that  the  LinkAsFilter  clustering  ap¬ 
proach  finds  meaningful  clusters.  Evaluating  clusterings  of 
datasets  for  which  there  is  no  right  answer  is  a  difficult  task. 
One  approach  is  to  present  the  resulting  clusters  for  user  ex¬ 
amination.  For  this  type  of  subjective  evaluation,  we  include 
example  cluster  members  from  two  real-world  datasets.  An¬ 
other,  more  objective,  approach  is  to  examine  cluster  utility 
by  evaluating  the  cluster  labels  ability  to  improve  a  related 
classification  task.  We  evaluate  three  approaches  ( LinkOnly , 
AttrOnly,  and  LinkAsFilter)  on  a  third  real-world  dataset 
in  this  manner,  and  show  the  LinkAsFilter  clusters  achieve 
a  significant  improvement  in  classification  accuracy. 

5.1  Datasets 

We  clustered  three  real-world  datasets  where  attributes  ex¬ 
hibit  correlation  among  linked  objects,  and  the  link  struc¬ 
ture  exhibits  clustering.  These  are  the  characteristics  we 
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Figure  5:  (a)  Intra-cluster  mean  by  metric,  (b) 

intra-cluster  variance  by  metric,  (c)  accuracy  of 
WtLinkAttr2  and  (d)  LinkAsFilter  for  50  objects. 


expect  to  find  in  datasets  that  contain  communities,  and  it 
is  in  these  situations  that  we  expect  our  clustering  algorithm 
will  perform  well. 

The  first  data  set  is  drawn  from  Cora,  a  database  of  com¬ 
puter  science  research  papers  extracted  automatically  from 
the  web  using  machine  learning  techniques  [12].  We  selected 
the  largest  connected  component  from  the  set  of  machine¬ 
learning  papers  published  after  1993.  The  resulting  graph 
contains  1,042  papers  and  2546  citation  links.  We  clus¬ 
tered  the  undirected  version  of  this  graph.  The  similarity 
metric  considered  two  topic  attributes  at  different  levels  of 
granularity  (e.g.,  {Machine  Learning,  Neural  Networks}  and 
{Planning,  Rule  Learning}). 

The  second  data  set  consists  of  a  set  of  web  pages  from 
four  computer  science  departments,  collected  by  the  WebKB 
Project  [4].  The  web  pages  have  been  manually  classified 
into  the  categories:  course,  faculty,  staff,  student,  research 
project,  or  other.  The  category  “other”  denotes  a  page 
that  is  not  a  home  page  (e.g.,  a  curriculum  vitae  linked 
from  a  faculty  page  or  homework  description  linked  from  a 
course  page).  The  collection  contains  approximately  4,000 
web  pages  and  8,000  hyperlinks  among  those  pages.  We 
clustered  the  largest  connected  component  in  these  data — a 
graph  of  1236  pages  and  3673  hyperlinks.  Again,  we  used 
the  undirected  version  of  the  graph.  The  similarity  metric 
considered  two  attributes:  page  category  and  department. 
However,  the  entire  component  is  from  a  single  department 
(Wisconsin)  so  the  department  attribute  adds  no  additional 
information. 

The  third  data  set  is  a  relational  data  set  containing  infor¬ 
mation  about  the  yeast  genome  at  the  gene  and  the  pro¬ 
tein  level  (www.cs.wisc.edu/~dpage/kddcup2001/).  The  data 


Table  1:  Cora  cluster  examples 

Cluster  9:  Belief  revision:  A  critique;  Plausibility  measures 
and  default  reasoning;  Modeling  belief  in  dynamic  systems.  Part 
I:  foundations;  Knowledge-Based  Framework  for  Belief  Change, 
Part  II:  Revision  and  Update;  Iterated  revision  and  minimal  re¬ 
vision  of  conditional  beliefs;  An  event-based  abductive  model  of 
update;  On  the  logic  of  iterated  belief  revision;  A  unified  model 
of  qualitative  belief  change:  A  dynamical  systems  perspective; 
Generalized  update:  Belief  change  in  dynamic  settings 
Cluster  14:  In  defense  of  C4.5:  Notes  on  learning  one-level 
decision  trees;  Exploring  the  decision  forest:  An  empirical  in¬ 
vestigation  of  Occams  razor  in  decision  tree  induction;  Algorith¬ 
mic  stability  and  sanity-check  bounds  for  leave-one-out  cross- 
validation;  Bias  and  the  quantification  of  stability;  Characteriz¬ 
ing  the  generalization  performance  of  model  selection  strategies; 
A  new  metric-based  approach  to  model  selection;  Preventing 
overfitting  of  Cross-Validation  data;  Further  experimental  evi¬ 
dence  against  the  utility  of  occams  razor 

Cluster  19:  An  empirical  evaluation  of  bagging  and  boosting; 
On-line  portfolio  selection  using  multiplicative  updates;  Hetero¬ 
geneous  uncertainty  sampling  for  supervised  learning;  Improved 
boosting  algorithms  using  confidence-rated  predictions;  On-line 
algorithms  in  machine  learning;  Training  algorithms  for  hidden 
Markov  models  using  entropy  based  distance  functions;  A  sys¬ 
tem  for  multiclass  multi-label  text  categorization;  Coevolution¬ 
ary  Search  Among  Adversaries 

Cluster  24:  Refinement  of  Bayesian  networks  by  combin¬ 
ing  connectionist  and  symbolic  techniques;  DistAl:  An  inter¬ 
pattern  distance-based  constructive  learning  algorithm;  An 
Anytime  Approach  to  Connectionist  Theory  Refinement:  Refin¬ 
ing  the  Topologies  of  Knowledge- Based  Neural  Networks;  Cre¬ 
ating  advice-taking  reinforcement  learners;  Learning  controllers 
for  industrial  robots;  Generating  accurate  and  diverse  members 
of  a  neural-network  ensemble;  A  Neural  Architecture  for  a  High- 
Speed  Database  Query  System;  Comparing  methods  for  refining 
certainty-factor  rule-bases; 


set  contains  information  about  1,243  genes  and  1,734  in¬ 
teractions.  We  clustered  the  largest  connected  component, 
which  consisted  of  814  genes  and  1475  interactions.  The 
similarity  metric  considered  13  boolean  function  attributes. 
Each  gene  may  have  multiple  functions.  We  evaluated  the 
resulting  cluster  labels’  ability  to  predict  gene  localization. 
We  applied  a  relational  Bayesian  classifier  [15]  to  the  entire 
dataset,  using  the  cluster  labels  as  an  additional  attribute, 
and  measured  performance. 

5.2  Results 

Clustering  the  sample  of  Cora  papers  produced  71  clusters 
varying  in  size  from  1-202  papers,  with  an  average  size  of 
15.  We  report  statistics  for  the  28  clusters  with  more  than 
six  papers.  Table  1  includes  randomly  selected  titles  from 
four  clusters  for  subjective  evaluation.  Although  we  did  not 
use  title  words  in  the  similarity  metrics,  the  clusters  show  a 
surprising  uniformity  among  the  titles.  This  indicates  that 
research  papers  can  be  clustered  into  meaningful  groups  us¬ 
ing  the  citation  structure  and  topic  attributes  alone. 

To  evaluate  intra-cluster  attribute  similarity,  we  averaged 
the  attribute  similarity  across  all  pairs  of  genes  within  each 
cluster.  As  a  baseline  measure  we  calculated  the  average  at¬ 
tribute  similarity  in  ten  random  clusterings.  Figure  6a  plots 
the  intra-cluster  attribute  similarity  (dark  bars)  compared 
to  the  expected  averages  given  random  clusterings  (light 
bars),  with  the  clusters  listed  in  ascending  order  by  size. 
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Figure  6:  Evaluation  of  hybrid  clusters  in  Cora. 


Table  2:  WebKB  cluster  examples 


Cluster  5:  http://www.es. wise. edu/Dienst/UI/2. 0/Describe/- 
ncstrl.uwmadison/CS-TR-89-890;  http:/ /www. cs.wisc.edu/- 

Dienst /UI  /  2 .0/Describe  /  ncstrl.  uwmadison  /  CS-TR-90-947 ; 
http:  /  /  www.cs.wisc.edu/Dienst /UI /2.0/Describe/- 
ncstrl.uwmadison/CS-TR-95-1283;  http:/ /www. cs.wisc.edu/- 

Dienst/UI  /  2 .0/Describe  /  ncstrl.  uwmadison/CS-TR-91- 1037 ; 
http:  /  /  www.cs.wisc.edu/Dienst /UI /2.0/Describe/ncstrl.- 
uwmadison/CS-TR-90-962;  http://www.cs.wisc.edu/Dienst/- 
UI/2.0/Describe/ncstrl.uwmadison/CS-TR-89-900;  http://- 
www.cs.wisc.edu/~reps/reps.html;  http:/ /www. cs.wisc.edu/- 

Dienst /UI/ 2 .0/Describe  /  ncstrl.  uwmadison  /  CS-TR-9 1- 1038 
Cluster  9:  http://www.cs.wisc.edu/~bart/537/quizzes/- 

quiz6.html;  http://www.cs.wisc.edu/~bart/cs537.html; 
http:  /  /  www.cs.wisc.edu/~bart  / 537 /  quizzes  /  quiz3.html; 
http:  /  /  www.cs.wisc.edu/~bart  / 537 /  quizzes  /  quizl0.html; 
http:  /  /  www.cs.wisc.edu/~bart  / 537 /  quizzes  /  quiz2.html; 
http:  /  /  www.cs.wisc.edu/~bart  / 537 /  programs  /  program2.html; 
http:  /  /  www.cs.wisc.edu/~bart  / 537 /lecturenotes/- 
titlepage.html;  http:/ /www. cs.wisc.edu/~bart/537/quizzes/- 
quiz9.html; 

Cluster  11:  http://www.cs.wisc.edu/~cs354-2/cs354/- 

lec. notes/numbers. html;  http:/ /www. cs.wisc.edu/- 

~cs354-2/cs354/lec. notes/data. structures.html;  http:/  /- 

www.es. wise. edu/~cs354-2/cs354/solutions/Q2.j. html;  http:/ /- 
www .  cs .  wise .  edu  /  ~cs354- 2 /  cs354 /  lec .  notes  /  arch .  features .  ht  ml ; 
http:  /  /  www.cs.wisc.edu/~cs354-2 /  cs354/lec.notes/- 
interrupts.html;  http:/ /www. cs.wisc.edu/~cs354-2/cs354/- 

lec. notes/case. studies. html;  http:/ /www. cs.wisc.edu/~cs354- 
2/cs354/lec.notes/arith. int.html;  http:/ /www. cs.wisc.edu/- 
~cs354-2 /  cs354/lec. notes /MAL.html; 

Cluster  14:  http://www.cs.wisc.edu/condor/research.html; 
http:/ /www. cs.wisc.edu/~bart/cs638. html;  http:/  /- 

www.cs.wisc.edu/coral/coral.people.html;  http:/ /- 

www.cs.wisc.edu/~brad/brad.html;  http:/ /www. cs.wisc.edu/- 
~sastry/spring96.html;  http:/ /www. cs.wisc.edu/~ashraf/- 
ashraf.html;  http:/ /maf.  wisc.edu/distributed/condor/- 

index.html;  http:/ /www. cs.wisc.edu/~ssl/resume. html; 
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Figure  7:  Evaluation  of  hybrid  clusters  in  WebKB. 


Attribute  similarity  is  significantly  higher  than  expected.2 
Note  that  the  largest  cluster  (#28)  does  not  exhibit  high 
linkage  or  attribute  similarity.  This  cluster  may  contain  the 
set  of  papers  that  could  not  be  partitioned  into  smaller  clus¬ 
ters  (i.e.,  the  papers  with  no  coherent  community  structure). 

Figure  6b  shows  the  actual  and  expected  proportion  of  intra¬ 
cluster  citations.  To  assess  the  connectivity  of  the  clusters, 
we  compared  the  proportion  of  intra-cluster  linkage  (per 
cluster)  to  expected  proportions,  given  ten  random  clus¬ 
terings.  Again,  the  proportion  of  intra-cluster  citations  is 
significantly  higher  than  the  expected  values.  This  indi¬ 
cates  that  the  clustering  technique  is  finding  groups  of  highly 
inter-connected  research  papers. 

Clustering  the  sample  of  WebKB  pages  produced  55  clusters 
varying  in  size  from  1-649  pages,  with  an  average  size  of  22. 
We  report  statistics  for  the  15  clusters  with  more  than  six 
pages,  listed  in  ascending  order  by  size.  Table  2  includes 
randomly  selected  URLs  from  four  clusters  for  subjective 
evaluation.  Recall  that  the  component  graph  only  contains 
pages  from  the  University  of  Wisconsin.  The  selected  clus¬ 
ters  appear  to  group  by  function — for  example,  tech  reports, 
course  pages,  or  research  group  pages. 

Figure  7b  plots  the  intra-cluster  averages  compared  to  the 
expected  averages  given  random  clusterings.  Figure  7b  shows 
the  actual  and  expected  proportion  of  intra-cluster  hyper¬ 
links.  The  proportion  of  intra-cluster  linkage  is  significantly 
higher  than  expected,  but  notice  that  the  largest  cluster’s 
(#15)  expected  linkage  is  quite  high  by  random  chance. 
This  may  indicate  that  the  largest  cluster  contains  a  set 
of  pages  that  are  too  tightly  connected  to  partition.  This 
clustering  does  exhibit  significantly  higher  than  expected  at¬ 
tribute  similarity.  However,  we  note  that  the  algorithm  is 
still  able  to  cluster  pages  into  groups  that  are  highly  inter¬ 
connected.  This  indicates  that  the  LinkAsFilter  metric  may 
be  robust  to  irrelevant  attribute  values. 

Clustering  the  sample  of  genes  produced  88  clusters  varying 
in  size  from  1-140  genes,  with  an  average  size  of  8.  We  report 
statistics  for  the  14  clusters  with  more  than  six  genes.  Intra¬ 
cluster  attribute  similarity  (figure  8a)  and  intra-cluster  link¬ 
age  (figure  8b)  are  both  significantly  higher  than  expected. 
These  results  show  that  the  LinkAsFilter  metric  can  be  used 
to  find  groups  of  genes  with  similar  functions  and  many  com¬ 
mon  interactions. 

The  structure  of  genomic  data  offers  an  opportunity  for  an 
objective  evaluation  of  the  clustering  results.  Clusters  of 
inter-connected  genes  with  similar  associated  functions  may 
indicate  a  group  of  genes  that  are  interacting  to  perform  a 
particular  function  in  the  cell.  If  this  is  the  case,  the  cluster 
labels  should  be  helpful  in  predicting  gene  localization  in  the 
cell.  To  test  this  hypothesis,  we  used  the  cluster  labels  to 
predict  gene  localization.  We  applied  a  relational  Bayesian 
classifier  (RBC)  [15]  to  the  gene  data,  using  the  cluster  labels 
as  an  additional  attribute,  and  measured  change  in  accuracy. 
Figure  8d  reports  average  10-fold  cross-validation  accuracies 
for  RBC  models  learned  using  the  cluster  labels  from  the 
LinkOnly,  AttrOnly,  and  LinkAsFilter  metrics.  The  baseline 


2We  assessed  significance  using  two-tailed  t-tests,  p  <  0.05. 
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Figure  8:  Evaluation  of  hybrid  clusters  in  Gene. 


RBC  model  used  twelve  attributes  for  prediction,  including 
gene  phenotype  and  motif,  and  achieved  an  average  accu¬ 
racy  of  66.3%.  The  RBC  model  that  included  cluster  labels 
from  AttrOnly  did  not  significantly  improve  accuracy. The 
model  that  included  cluster  labels  from  LinkOnly  achieved 
a  significant  improvement  in  accuracy,  with  an  average  of 
68.4%,  indicating  that  gene  interactions  alone  are  helpful 
for  predicting  location.  However,  the  model  that  included 
cluster  labels  from  LinkAsFilter  achieved  an  average  accu¬ 
racy  of  70.2%.  This  is  a  significant  improvement  over  both 
LinkOnly  and  the  baseline  RBC  model  without  cluster  la¬ 
bels,  which  demonstrates  the  utility  of  clustering  for  com¬ 
munities  using  both  attribute  and  link  information. 

6.  DISCUSSION 

This  paper  presents  a  hybrid  metric  for  spectral  clustering 
algorithms  that  exploits  both  attribute  information  and  link 
structure  to  improve  discovery  of  communities  in  relational 
data.  There  has  been  relatively  little  work  investigating 
clustering  techniques  for  relational  domains.  The  work  in 
this  area  has  focused  on  either  complex  generative  models 
with  latent  variables  [If,  20,  3],  or  augmented  clustering 
techniques  that  use  ad-hoc  similarity  metrics  to  incorporate 
both  link  and  attribute  information  [14,  9].  Due  to  the  com¬ 
plexity  of  probabilistic  relational  models  with  latent  vari¬ 
ables,  and  the  sparsity  of  relational  graphs  that  enable  the 
use  of  efficient  eigensolver  techniques,  we  chose  to  explore 
extensions  to  spectral  clustering  for  relational  domains. 

The  most  closely  related  prior  work  is  that  of  He,  Ding,  Zha, 
and  Simon  [9],  which  uses  a  spectral  graph-partitioning  al¬ 
gorithm  to  automatically  identify  topics  in  sets  of  retrieved 
web  pages.  This  approach  uses  a  similarity  measure  specifi¬ 
cally  designed  for  high-dimensional  text  domains  with  weighted 
co-citation  links.  We  differ  from  this  work,  and  other  re- 

^  Again,  significance  was  assessed  using  two-tailed  t-tests, 
p  <  0.05. 


search  on  hybrid  spectral  algorithms,  in  our  exploration  of 
the  characteristics  that  underlie  successful  similarity  met¬ 
rics. 

We  have  set  up  a  framework  to  evaluate  different  similarity 
metrics  quantitatively  over  a  wide  range  of  relational  data 
sets.  Our  experiments  show  that  increasing  the  separation 
between  total  intra-cluster  and  inter-cluster  transition  prob¬ 
abilities  results  in  superior  performance  over  a  wide  range  of 
data  characteristics.  One  way  to  increase  the  separation  be¬ 
tween  cluster  transition  probabilities  is  to  drop  potentially 
noisy  information  from  consideration.  Using  this  approach, 
we  expect  the  LinkAsFilter  metric  will  successfully  recover 
groupings  over  a  wide  range  of  data  characteristics. 

There  are  two  primary  advantages  to  using  the  LinkAsFilter 
metric.  The  first  advantage  is  algorithm  efficiency — there 
are  0{E)  approximate  eigensolver  algorithms,  and  there  are 
0(n1A)  exact  eigensolver  algorithms  for  sparse  matrices  that 
can  exploit  the  sparse  matrix  structure  produced  by  the  met¬ 
ric.  The  second  advantage  is  the  choice  of  a  =  l,  which  is 
independent  of  data  characteristics.  We  expect  the  metric 
will  work  well  in  any  dataset  exhibiting  community  struc¬ 
ture,  provided  there  is  enough  data  to  withstand  the  associ¬ 
ated  increase  in  variance.  In  small  datasets,  where  the  size 
of  the  data  cannot  offset  the  increase  in  variance,  the  appli¬ 
cation  of  balanced  metrics  (e.g.,  WtLinkAttr2 )  may  produce 
superior  clusterings.  In  practice  however,  this  approach  is 
limited  by  the  need  to  set  a  to  balance  the  link  and  attribute 
information. 

With  a  way  to  evaluate  each  setting,  an  algorithm  could 
search  for  the  best  a.  Our  analysis  indicates  that  the  “best” 
settings  will  maximize  the  separation  between  the  intra¬ 
cluster  and  inter-cluster  transition  probabilities.  We  con¬ 
jecture  that  the  eigenvector  information — more  specifically, 
the  separation  between  the  means  of  distributions  of  the 
eigenvector  values  on  either  side  of  the  cut — can  be  used  to 
approximate  this  information.  We  report  preliminary  find¬ 
ings  in  support  of  this  conjecture. 

Figure  9a  graphs  the  correlation  between  algorithm  perfor¬ 
mance  and  the  separation  of  eigenvector- value  distributions. 
We  clustered  over  the  space  of  synthetic  datasets  described 
in  section  4.1  using  20  different  values  of  a,  chosen  uni¬ 
formly  in  the  range  [0, 1].  We  recorded  (1)  the  accuracy  of 
the  clustering,  and  (2)  the  distance  between  the  means  of  the 
eigenvector-value  distributions  on  either  side  of  the  chosen 
cut  (after  the  values  were  normalized  to  unit  range).  Fig¬ 
ure  9b  shows  performance  when  we  set  a  by  maximizing  the 
separation  between  the  means  of  the  eigenvector-value  dis¬ 
tributions.  Comparing  this  graph  to  figure  1,  we  can  see  that 
this  technique  approaches  the  performance  of  the  LinkAs¬ 
Filter  metric.  This  is  a  promising  direction  to  explore  for 
applications  with  little  data,  where  the  variance  will  be  too 
high  to  apply  LinkAsFilter  successfully. 

7.  CONCLUSIONS  AND  FUTURE  WORK 

We  have  analyzed  the  spectral  decomposition  algorithm  from 
a  statistical  perspective  and  shown  that  the  successful  hy¬ 
brid  metrics  use  the  link  and  attribute  information  to  in¬ 
crease  the  separation  between  noisy  clusters.  We  have  shown 
an  empirical  connection  between  the  distribution  of  trail- 
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Figure  9:  Searching  for  a  to  use  in  the  metric:  (a) 
correlation  between  separation  of  eigenvector  values 
and  accuracy  (corr  =  0.71),  and  (b)  cluster  accuracy 
using  a  that  maximizes  separation. 

sition  probabilities  and  algorithm  performance,  connecting 
both  mean  and  variance  to  cluster  accuracy.  Future  work 
will  compare  this  approach  to  latent-variable  relational  mod¬ 
els  and  explore  complexity/efficiency  tradeoffs  between  the 
two  techniques.  Furthermore,  we  will  attempt  to  derive  the¬ 
oretical  bounds  on  finite-sample  performance,  and  explore 
the  alternative  optimization  criteria  for  data  with  low  mean 
separation,  where  the  NCut  criteria  prefers  high-precision/low- 
recall  groupings. 

In  addition,  the  WebKB  results  suggest  an  alternative  clus¬ 
tering  task — clustering  data  that  exhibit  role  equivalence 
structure,  rather  than  community  structure.  Objects  that 
play  the  same  roles  in  a  graph  have  similar  attributes  and 
similar  link  patterns  but  may  not  actually  link  to  each  other. 
For  example,  faculty  pages  rarely  link  to  each  other  but  they 
conistently  link  to  student  and  course  pages.  Current  meth¬ 
ods  for  grouping  data  in  this  manner  focus  primarily  on  link 
information  (e.g.,  [17]).  Extending  this  work  to  incorporate 
attribute  information  seems  an  exciting  direction  to  explore. 
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APPENDIX 

A.  PROOF  OF  THEOREM 

Theorem:  Let  A  =  (Ai,A2)  be  a  partition  of  V.  Let 
the  function  S(i,j)  define  the  similarity  measure  between 
Vi,Vj  £  V.  If,  Vi,j,k,  S(i,j)  is  conditionally  independent 
of  S(i,k)  given  node  i,  and  E[Pn]i?[P22]  ^  E[Pi2]E[P2i] 
then,  P  has  an  eigenvector  that  will  converge  to  piecewise 
constant  w.r.t.  A  as  |Ai|,  | ^4_2 1  — >  oo. 


abilities  P;n  and  P out,  and  E  is  a  perturbation  matrix  with 
1 1 E|  1 2  =  1.  Then  by  matrix  perturbation  theory  [8]: 


(P'  +  eE)xi(e)  =  Ai(e)xi(e) 

where  xt(e)  =  x,  +  e  E"=1,^  {  (A^gVx,  }  +  0(e2) 
and  A;(e)  =  Ai  ±  — ~r~\ 


Proof.  In  order  to  simplify  the  calculations  below,  we 
assume  that  the  two  clusters  share  the  same  distribution  of 
intra-  and  inter-  cluster  similarity  values.  The  symmetry 
in  attribute  parameters  simplifies  the  analysis  but  is  not 
necessary  for  correctness.  Let  pin  be  the  mean  intra-cluster 
similarity  for  nodes  i,j  £  Ai  or  i,  j  £  A2.  Similarly,  let  fj,out 
be  the  mean  inter-cluster  similarity  for  nodes  i  £  A\  and 

j  £  A2. 


Here  x,,  y,,  and  A ,,  are  the  right  and  left  eigenvectors,  and 
the  eigenvalues  of  P'.  As  n\,n2  — >  oo,  e  — ►  0  and  the 
eigenvectors  of  P  will  converge  to  the  eigenvectors  of  P'. 
Therefore  the  graph  will  converge  to  a  Markov  chain  with 
state  space  A  =  (Ai,  A2),  and  constant  transition  probabil¬ 
ities  Rn  =  R22  =  E[ P-„],  and  R12  =  R21  =  E\Pzout\.  If 
R11  ^  R12,  then  R  will  be  non-singular,  and  by  proposition 
2  in  [13].  P  will  have  a  piecewise  linear  eigenvector  w.r.t 
A.  □ 


We  can  represent  each  entry  in  W  as  a  random  variable. 
Consider  the  entries  of  row  i.  The  entries  W,, ,  W^-  are  not 
independent  because  the  similarity  values  are  both  based 
on  node  i.  However,  conditioned  on  the  state  of  i  (e.g.  at¬ 
tribute  values  of  i),  the  entries  can  be  viewed  as  independent 
random  variables  if  the  state  of  j  is  independent  of  the  state 
of  k.  This  assumption  corresponds  to  a  generative  model  in 
which  the  objects  and  links  in  the  graph  are  conditionally 
independent  given  the  object  cluster  memberships. 

We  will  calculate  the  expected  intra-  and  inter-cluster  transi¬ 
tion  probabilities  in  P  as  a  ratio  of  sums  of  random  variables. 
Let  Tfn  be  the  total  intra-cluster  transition  probability  for 
node  i,  where  i  £  Aktke  1,2,  and  let  \Ak\  =  nk .  Similarly,  let 
Tout  be  the  total  inter-cluster  transition  probability,  and  T^u 
be  the  total  transition  probability.  Then  P]n  is  the  ratio  of 
T\n  and  TlaU,  and  P !out  is  the  ratio  of  Tfut  and  TuU. 

The  normalized  transition  probabilities  in  P  then  corre¬ 
spond  to  the  ratio  of  two  random  variables  (e.g.,  TlnITlaU), 
which  can  be  approximated  using  a  truncated  Taylor  se¬ 
ries  expansion.  The  expectation  and  variance  for  intra-  and 
inter-cluster  normalized  transition  probabilities  are  below. 
(Analytical  derivations  are  included  in  Section  A.l.) 


A.l  Analytic  Derivations 

When  S(i,j)  is  conditionally  independent  of  S(i,k)  given 
the  state  of  node  i,  the  cluster  transition  probabilities  are 
simply  sums  of  independent  random  variables.  Using  condi¬ 
tional  expectation  (E[h(X,Y)\  =  Ex{E[h.(X,Y)\X]}),  we 
can  calculate  the  expectation  for  T/n  based  on  the  state  of 
i,  which  we  refer  to  as  is'- 

E[TL]  =E[J2jeAkS(i,j)] 

=  E is  p(is)  ■  E[J2jeAk  S(is,j )] 

=  Eis  p(is )  •  nk  ■  E[S(is,j)\j  €  Ak\ 

=  nk  ■  E is  P(is)  ■  EJS  P(js)  ■  S(is,js) 

=  nk  ■  J2is  EjS  P(*s)  •  p(js)  ■  S(is,js) 

=  nk  • 
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Total  inter-cluster  and  overall  means  are  calculated  in  a  sim¬ 
ilar  fashion.  E[Tlut]  =  ny  ■  p0ut,  and  E[Tlan]  =  (nfc  •  pin)  + 
( nk '  •  /Tout)?  where  ny  —  ni^z£k. 


E[  P\n] 

=  E[TL/T'all] 

^ Tin 
~  ^ Tall 

■[1+1 

aTall  ]2 

M Tin 

E[P‘out] 

=  E[Tiut/Tiu] 

~  VTout 

'[IT 

J  aTaU  ]2 
+TaiiJ 

MTol 

where  a xy  is  the  covariance  of  X,  Y. 

As  ni,ri2  — >  00,  it  follows  directly  from  the  Law  of  Large 
Numbers  that  the  value  of  T}n/T?n  — ►  1  for  i,j  £  Ak,  since 
Tin  is  a  sum  of  independent  random  variables  with  finite 
mean  and  variance.  A  similar  argument  holds  for  Tout  and 
Tau.  Now  consider  the  normalized  transition  probabilities 
for  P.  If,  in  the  limit,  the  sums  T™,  (and  Tlut,  T^u)  converge 
to  the  same  value  for  all  i  £  Ak,  then  the  normalized  sums 
P[n  will  converge  to  the  same  value  P,n  for  all  i  £  Ak.  A 
similar  argument  holds  for  Pout- 

As  ni, n2  — >  cx),  we  can  decompose  the  matrix  P  into  P  = 
P'  +  eE,  where  P'  is  a  matrix  with  constant  transition  prob- 


The  variance  of  the  total  intra-cluster  similarity  is  calculated 
as  follows  4: 

Var\Tln ]  =Var[J2jeAkS(i,j)\ 

=  Eis{Var[J2jeAk  S(is,j)}} 

=  EisP(*s)  •  Var[J2jeAk  S(is,j)] 

=  EisP(*s)  •  nk  ■  Var[S(is,j)\j  €  Ak] 

=  nk  ■  Eis  Ejs  p(is) -p(js)  ■  {S(is,js)  ~  Eis[S(is,js)]}2 

Total  inter-cluster  and  overall  variance  are  calculated  in  a 
4The  derivation  uses  the  following  equivalence: 

Var(h(X,  Y))  =  E[h{X,  Y)2]  -  E[h(X,  Y)]2 

=  Ex{E[h(X,Y)2\X]}  -  Ex{E[h{X,Y)\X)2} 

=  Ex{Var(h{X,Y)\X)} 


similar  fashion:  V  ar[Tlout]  =  ny-J2isP(is)  ■  Var[S(is,j)\j  €  Ay], 
and  Var[T'all]  =  Y,isP{is)  {ny  ■  Var[S(is,j)\j  €  Ay] 

+nk  ■  V ar[S (is,  j)\j  €  Ak]}. 

From  these  we  can  calculate  the  expected  transition  prob¬ 
abilities  of  P  using  the  ratio  of  two  random  variables  (e.g., 
Tin/Taii).  These  calculations  use  an  approximation  of  the 
ratio  of  two  random  variables,  based  on  a  truncated  Taylor 
series  expansion: 

E[X/Y]  «  ^  •  [1  +  f^]2  - 

l  /  j  My  L  lmyj  MxMyJ 

Var(X/Y)  «  [ ^l}2  •  [f^l2  +  f^l2  -  2^^] 

V  /  /  lMy  J  llMxj  LMyJ  MxMyJ 


The  expectation  and  variance  for  intra-  and  inter-cluster 
normalized  transition  probabilities  are  as  follows: 


E{  PL] 

=  E[Tfn/Tiu] 

Var[ PL] 

=  V  ar [Tin /Tan] 

E[  PLt] 

=  E[T'out/T'all] 

Var[Plut] 

=  Var[TiKt/T'aH 

MTl? 


[l  +  [ 


aTall  12 


CTTin  12 


l(*T0 


GTir, 

iTall  1 

J 

\GT«ll  1 

2  O  aTinTall  i 

lPTall* 

^Tin^Tall 

!  aTo 

utTaIl  | 

MTol 

1  f  Zlall 

12  o  aToutTall 

+  Uto11 

where  ffx v  is  the  covariance  of  X,  Y.  For  the  equations 
above,  the  covariance  of  Ti„  and  Taii  reduces  to  the  vari¬ 
ance  of  Tin,  using  conditional  expectation  to  eliminate  the 
covariance: 

aTinTaii  =  E[Ti-nTan ]  —  E[Ti„]  ■  E[Tau\ 

=  E[Tin(Ti„  +  Tout)]  —  E[Tin]  •  -E[(Tjn  +  Tout)] 

=  E[Tin  +  Tin  ■  Tout]  —  E[Tin]2  —  E]Ttn]  ■  E[Tout] 

=  E[T?n]  +  E[Tin  ■  Tout]  -  E[Tin]2  -  E[Tin]  ■  E[Tout] 

=  E[T?n]  -  E[Tin]2  +  E[Tin  ■  Tout]  -  E[Tin ]  ■  E[T0ut] 

=  Var(Tin)  —  J2is  p(is){E[Tirl  ■  Tout\i]  —  E[Tin\i]  •  E[Tollt\i]} 
=  Var(Tin )  -  Ets  P{is)  •  0 
=  Var(Tin) 


A  similar  derivation  applies  to  the  covariance  of  Tout  and 

Tall ■ 


